Search and download of transcriptomic data from the GEO database

Author

Luciano Anselmino

Where to search human data gene expression?

For searching human gene expression data, you can explore several major public databases and repositories that host a wide range of gene expression datasets. Here are some key resources:

In this workflow, we will focus on the GEO database, as the data selected for the study were specifically downloaded from there.

About GEO

GEO (Gene Expression Omnibus) is a public repository for genomic data. Its primary purpose is to store and share data from gene expression experiments, where the activity of genes is analyzed under different conditions or cell types. GEO serves as a centralized source for researchers to share their data with the scientific community, allowing others to reuse this data in new analyses or comparative studies.

GEO stores various types of biological data, including:

  1. Gene expression data: Information about gene expression levels under different conditions, such as tissues, cell types, cell lines, or in response to specific treatments. This includes data from microarrays and RNA-seq.

  2. DNA and RNA sequencing data: Results from sequencing experiments, including RNA-seq (transcriptomes), ChIP-seq (protein-DNA interactions), and ATAC-seq (chromatin accessibility).

  3. Genotyping and genetic variant data: Information about genomic variations, such as single nucleotide polymorphisms (SNPs), structural variants, and mutations, obtained from GWAS studies or other approaches.

  4. Metadata: Descriptions of samples, including biological characteristics (e.g., age, sex, or disease state), experimental conditions (treatments, exposure times), and protocols used.

  5. Epigenetic profiles: Data related to epigenetic modifications, such as DNA methylation patterns or histone modifications.

  6. Molecular interaction data: Information from experiments like RIP-seq or CLIP-seq that identify interactions between molecules, such as RNA-RNA or protein-RNA interactions.

  7. Raw count matrices: Files containing raw data from sequencing or microarray experiments, allowing for customized analyses.

  8. Chromatin accessibility profiles: Data from assays like DNase-seq or ATAC-seq, revealing accessible regions of the genome.

How Data is Stored in GEO (Under Construction 🚧)

About GEOquery

GEOquery is a package in the R programming language that facilitates downloading and manipulating datasets from the GEO database. It is used to access various types of genomic data, such as gene expression matrices, time series data, and study metadata, directly from R, without the need to manually download them from the GEO website.

First, you need to install the package in R if you haven’t already. This can be done by running:

install.packages("BiocManager")
BiocManager::install("GEOquery")

Then, you need to load GEOquery in your R session

library(GEOquery)

To download data from GEO, you need to know the dataset identifier (e.g., “GSEXXXX”). You can read about how GEO organizes its information [here] and also learn how to perform a data search in GEO [here]. Once you have chosen your dataset, you can use the getGEO function to fetch it.

gse <- getGEO("GSEXXXX", GSEMatrix = TRUE) #Here, GSEXXXX is your selected GEO series, for example, GSE39582

Where GSEMatrix = TRUE indicates that you want the expression data in matrix form, which is common in many analyses.

What do I download when I use the GEOQuery getGeo function?

Gene expression data in R is most often downloaded as a data structure called “ExpressionSet” which is an S4 type object. An S4 object is a way to organize and work with data in R. Think of it as a box that can hold different types of data and knows what to do with it based on how it is labeled (its “class”). Inside this box, you can have various components, such as expression data and metadata (clinical data, data about how the experiment was performed, etc.). The label (or “class”) of the box tells R what kind of data it is handling and what functions to use to work with it. To work with S4 objects, you can use generic functions and methods specific to those objects. Here are some examples:

Data Access:

exprs(): with this method retrieves the gene expression matrix.
pData(): this method retrieves information about each samples (metadata).
fData(): this method retrieves information about the features (probes or reads) measured by the technology (chip or sequencer).
experimentData(): this method provides general metadata about the experiment, such as drugs, doses, time points, gene silencing, etc.
annotation(): information about the platform (name, model, etc.).

Let’s see this whit code:

library(GEOquery)

# Download data from GEO
gse <- getGEO("GSEXXXX", GSEMatrix = TRUE) #Here, GSEXXXX is your selected GEO series, for example, GSE39582.

# The GEOSerie information is saved in a list that always has one element, but this element is an ExpressionSet class object, which is associated with the methods. 
gse_data <- gse[[1]] # If you try something like "gse_data <- gse[[2]]" you will obtain an error

# Access the expression matrix
expression_data <- exprs(gse_data)

# Access the metadata
metadata <- pData(gse_data)

# Access the feature data
feature_data <- fData(gse_data) # Usually, here you can find information about the probes, for example, the gene symbol associated with the probe, the ENTREZ_ID and the gene ontology.

# Access the experiment general information 
experiment_data <- fData(gse_data) # here, you can find information about, for example, the last date the dataset was modified or the email of the contact person for questions.

# Access the plataform name
plataform<-annotation(gse_data)

In the image above, you can observe a schematic representation of how an object of the ExpressionSet type is structured.

How is the gene expression matrix organized?

The information in the expression matrix will depend on the technique used to obtain the gene expression values. The two most common technologies today are microarrays (chips) and RNA sequencing (RNA-seq or high-throughput platforms).

When you execute expression_data <- exprs(gse_data), you save a matrix object with the name “expression”

In the case of microarrays, this object has the following characteristics:

  • Probes are in rows: a probe is a small DNA (or RNA) sequence designed to bind (hybridize) to a specific sequence of interest. Each gene is typically represented by multiple probes. If you want to know more about how microarrays work, I recommend this amazing video (link). The important thing here is that the probe names in the rows are written with a chip-specific code.

  • Samples are in columns: each column of the expression matrix represents a patient or a sample. The samples also have a specific code, whose structure depends on the researchers performing the analysis. In the case of the GEO database, the samples always begin with “GSM” followed by a number, for example, GSM972020. But keep in mind that if you plan to download gene expression data from other databases, the sample codes might be very different.

  • The expression values are in the cells: for microarrays, the gene expression values represent the “luminescence” (light intensity) emitted by the hybridization between the RNA entering the chip’s well and the specific probe it contains.

In the case of RNA-seq, the object has little differences:

  • Genes names or code are in rows (there are no probes because the technique relies on the direct sequencing of RNA fragments). Usually, you can find official gene name , ENTEREZ ID or ENSAMBLE IDs in rows.

  • The columns continue to be the patients or samples.

  • Values in the cells now are “read count”, it represent the number of RNA fragments that align with a specific region of the genome or a particular gene. They are an approximate measure of the RNA abundance representing a gene in the sample.